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Abstract 

We survey classical kernel methods for providing nonparametric solu- 
tions to problems involving measurement error. In particular we outline 
kernel-based methodology in this setting, and discuss its basic properties. 
Then we point to close connections that exist between kernel methods 
and much newer approaches based on minimum contrast techniques. The 
connections are through use of the sine kernel for kernel-based inference. 
This 'infinite order' kernel is not often used explicitly for kernel-based 
deconvolution, although it has received attention in more conventional 
problems where measurement error is not an issue. We show that in 
a comparison between kernel methods for density deconvolution, and 
their counterparts based on minimum contrast, the two approaches give 
identical results on a grid which becomes increasingly fine as the band- 
width decreases. In consequence, the main numerical differences between 
these two techniques are arguably the result of different approaches to 
choosing smoothing parameters. 
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1 Introduction 



1.1 Summary 

Our aim in this paper is to give a brief survey of kernel methods for 
solving problems involving measurement error, for example problems 
involving density deconvolution or regression with errors in variables, 
and to relate these 'classical' methods (they are now about twenty years 
old) to new approaches based on minimum contrast methods. Section 1.1 
motivates the treatment of problems involving errors in variables, and 
section 1.2 describes conventional kernel methods for problems where the 
extent of measurement error is so small as to be ignorable. Section 2.1 
shows how those standard techniques can be modified to take account 
of measurement errors, and section 2.2 outlines theoretical properties of 
the resulting estimators. 

In section 3 we show how kernel methods for dealing with measure- 
ment error are related to new techniques based on minimum contrast 
ideas. For this purpose, in section 3.1 we specialise the work in section 2 
to the case of the sine kernel. That kernel choice is not widely used for 
density deconvolution, although it has previously been studied in that 
context by Stefanski and Carroll (1990), Diggle and Hall (1993), Barry 
and Diggle (1995), Butucea (2004), Meister (2004) and Butucea and 
Tsybakov (2007a, b). Section 3.2 outlines some of the properties that are 
known of sine kernel estimators, and section 3 points to the very close 
connection between that approach and minimum contrast, or penalised 
contrast, methods. 

1.2 Errors in variables 

Measurement errors arise commonly in practice, although only in a 
minority of statistical analyses is a special effort made to accommodate 
them. Often they are minor, and ignoring them makes little difference, 
but in some problems they are important and significant, and we neglect 
them at our peril. 

Areas of application of deconvolution, and regression with measure- 
ment error, include the analysis of seismological data (e.g. Kragh and 
Laws, 2006), financial analysis (e.g. Bonhomme and Robin, 2008), dis- 
ease epidemiology (e.g. Brookmeyer and Gail, 1994, Chapter 8), and 
nutrition. 

The latter topic is of particular interest today, for example in con- 
nection with errors-in-variables problems for data gathered in food fre- 
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qucncy questionnaires (FFQs), or dietary questionnaires for epidemi- 
ological studies (DQESs). Formally, an FFQ is 'A method of dietary 
assessment in which subjects are asked to recall how frequently certain 
foods were consumed during a specified period of time,' according to the 
Nutrition Glossary of the European Food Information Council. An FFQ 
seeks detailed information about the nature and quantity of food eaten 
by the person filling in the form, and often includes a query such as, 
"How many of the above servings are from fast food outlets (McDon- 
alds, Taco Bell, etc.)?" (Stanford University, 1994). This may seem a 
simple question to answer, but nutritionists interested in our consump- 
tion of fat generally find that the quantity of fast food that people admit 
to eating is biased downwards from its true value. The significant con- 
cerns in Western society about fat intake, and about where we purchase 
our oleaginous food, apparently influences our truthfulness when we are 
asked probing questions about our eating habits. 

Examples of the use of statistical deconvolution in this area include the 
work of Stefanski and Carroll (1990) and Delaigle and Gijbels (2004b), 
who address nonparametric density deconvolution from measurement- 
error data, obtained from FFQs during the second National Health and 
Nutrition Examination Survey (1976-1980); Carroll et al. (1997), who 
discuss design and analysis aspects of linear measurement-error models 
when data come from FFQs; Carroll et al. (2006), who use measurement- 
error models, and deconvolution methods, to develop marginal mixed 
measurement-error models for each nutrient in a nutrition study, again 
when FFQs are used to supply the data; and Staudenmayer et al. (2008), 
who employ a dataset from nutritional epidemiology to illustrate the use 
of techniques for nonparametric density deconvolution. See Carroll et 
al. (2006, p. 7) for further discussion of applications to data on nutrition. 

How might we correct for errors in variables? One approach is to 
use methods based on deconvolution, as follows. Let us write Q for the 
quantity of fast food that a person admits to eating, in a food frequency 
questionnaire; let Q denote the actual amount of fast food; and put 
R = Q/Qo- We expect that the distribution of R will be skewed towards 
values greater than 1, and we might even have an idea of the shape of 
the distribution responsible for this effect, i.e. the distribution of log R. 
Indeed, we typically work with the logarithm of the formula Q = Qo R, 
and in that context, writing W — logQ, X = logQo an d U — logi?, the 
equation defining the variables of interest is: 



W = X + U. 



(1.1) 
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We have data on W, and from that we wish to estimate the distribution 
of X, i.e. the distribution of the logarithm of fast-food consumption. 

It can readily be seen that this problem is generally not solvable unless 
the distribution of U , and the joint distribution of X and U, are known. 
In practice we usually take X and U to be independent, and undertake 
empirical deconvolution (i.e. estimation of the distribution, or density, 
of X from data on W) for several candidates for the distribution of U. 
If we are able to make repeated measurements of X, in particular to 
gather data on = X + fjW) for 1 < j < m, say, then we have an 

opportunity to estimate the distribution of U as well. 

It is generally reasonable to assume that X, . . . , U^ M ' are in- 

dependent random variables. The distribution of U can be estimated 
whenever m > 2 and the distribution is uniquely determined by |</>j/| 2 , 
where <fiu denotes the characteristic function of U. The simplest example 
of this type is arguably that where U has a symmetric distribution for 
which the characteristic function does not vanish on the real line. One 
example of repeated measurements in the case m = 2 is that where a 
food frequency questionnaire asks at one point how many times we vis- 
ited a fast food outlet, and on a distant page, how many hamburgers or 
servings of fried chicken we have purchased. 

The model at (jl.ip is simple and interesting, but in examples from 
nutrition science, and in many other problems, we generally wish to 
estimate the response to an explanatory variable, rather than the dis- 
tribution of the explanatory variable. Therefore the proper context for 
our food frequency questionnaire example is really regression, not dis- 
tribution or density estimation. In regression with errors in variables we 
observe data pairs (W, Y), where 

W = X + U, Y = g(X)+V, (1.2) 

g(x) = E{Y\X = x), and the random variable V, denoting an ex- 
perimental error, has zero mean. In this case the standard regression 
problem is altered on account of errors that are incurred when meas- 
uring the value of the explanatory variable. In (|1.2j) the variables U, V 
and X are assumed to be independent. 

The measurement error U, appearing in (jl.ll) and (II. 2[) . can be in- 
terpreted as the result of a 'laboratory error' in determining the 'dose' 
X which is applied to the subject. For example, a laboratory technician 
might use the dose X in an experiment, but in attempting to determine 
the dose after the experiment they might commit an error U, with the 
result that the actual dose is recorded as X + U instead of X. Another 
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way of modelling the effect of measurement error is to reverse the roles 
of X and W, so that we observe (W, Y) generated as 

X = W + U, Y = g(X)+V. (1.3) 

Here a precise dose W is specified, but when measuring it prior to the 
experiment our technician commits an error U, with the result that 
the actual dose is W + U. In (jl.3p it assumed that U, V and W are 
independent. 

The measurement error model (11.21) is standard. The alternative model 
(|1.3I) is believed to be much less common, although in some circum- 
stances it is difficult to determine which of (11.21) and (|1.3p is the more 
appropriate. The model at (jl.3|) was first suggested by Berkson (1950), 
for whom it is named. 



1.3 Kernel methods 

If the measurement error U were very small then we could estimate the 
density / of X, and the function g in the model (ll.2j) . using standard 
kernel methods. For example, given data X±, . . . , X n on X we could 
take 

i=l 

to be our estimator of f(x). Here if is a kernel function and h, a positive 
quantity, is a bandwidth. Likewise, given data (Xi,Yi), . . . , (X n , Y n ) on 
(X, Y) we could take 

to be our estimator of g(x), where g is as in the model at (|1.2JI . 

The estimator at l|1.4l) is a standard kernel density estimator, and is 
itself a probability density if we take if to be a density. It is consistent 
under particularly weak conditions, for example if / is continuous and 
h — > and nh — >■ oo as n increases. Density estimation is discussed at 
length by Silverman (1986) and Scott (1992). The estimator g, which we 
generally also compute by taking if to be a probability density, is often 
referred to as the 'local constant' or Nadaraya-Watson estimator of g. 
The first of these names follows from the fact that g(x) is the result of 
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fitting a constant to the data by local least squares: 

g{x) = argmin ^ (Y- - c) 2 k( ^~ - j . (1.6) 

c i=l 

The estimator 5 is also consistent under mild conditions, for example if 
the variance of the error, V, in (|1.2p is finite, if / and g are continuous, if 
/ > at the point x where we wish to estimate g, and if — ^ and nh — > 
00 as n increases. General kernel methods are discussed by Wand and 
Jones (1995), and statistical smoothing is addressed by Simonoff (1996). 

Local constant estimators have the advantage of being relatively ro- 
bust against uneven spacings in the sequence X\, . . . , X n . For example, 
the ratio at (|1.5[) never equals a nonzero number divided by zero. How- 
ever, local constant estimators are particularly susceptible to boundary 
bias. In particular, if the density of X is supported and bounded away 
from zero on a compact interval, then g, defined by Q1.5p or (|1.6I) . is 
generally inconsistent at the endpoints of that interval. Issues of this 
type have motivated the use of local polynomial estimators, which are 
defined by g(x) = cq(x) where, in a generalisation of (|1.6p . 

™ f P 1 2 x X 

(co(ar), ■ • -,c P (x)) = argmin V* < Y t - V" Cj (x - Xrf \ K[ — — —) . 

(oo,...,c,) S I p(, ) 

(1.7) 

See, for example, Fan and Gijbels (1996). In (|1.7[) . p denotes the degree 
of the locally fitted polynomial. The estimator g(x) — cq(x), defined 
by (|1.7p . is also consistent under the conditions given earlier for the 
estimator defined by p.5p and (jl.6p . 

In the particular case p = 1 we obtain a local-linear estimator of g(x): 

S 2 (x) T (x) - Si(p) Ti(x) 
9[X) - S ix)S 2 ix)-Siixr ' li ' 8j 



where 



nh ^—^ 



1 ™ 



nh ^— ' V h 

i=l 



£ — X," \ fx — Xj 

K 



(1.9) 



/i denotes a bandwidth and -K' is a kernel function. 

Estimators of all these types can be quickly extended to cases where 
errors in variables are present, for example as in the models at (11.11) 
and (|1.2p . simply by altering the kernel function K so that it acts to 
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cancel out the influence of the errors. We shall give details in section 2. 
Section 3 will discuss recently introduced methodology which, from some 
viewpoints looks quite different from, but is actually almost identical to, 
kernel methods. 

2 Methodology and theory 

2.1 Definitions of estimators 

We first discuss a generalisation of the estimator at (|1.4I) to the case 
where there are errors in the observations of Xi, as per the model at (jl.ip . 
In particular, we assume that we observe data W\, . . . , W n which are 
independent and identically distributed as W = X + U, where X and 
U are independent and the distribution of U has known characteristic 
function <f>u which does not vanish anywhere on the real line. Let K be 
a kernel function, write cf>K = J e ltx K{x) dx for the associated Fourier 
transform, and define 

Then, to construct an estimator / of the density f = fx of X, when all 
we observe are the contaminated data W\ , . . . , W n , we simply replace 
K by K u, and X^ by Wi, in the definition of / at (|1.4I) . obtaining the 
estimator 

/decon(x) = ^Y j K V (^jP-) • (2-2) 
i—1 

Here the subscript 'decon' signifies that /decon involves empirical decon- 
volution. The adjustment to the kernel takes care of the measurement 
error, and results in consistency in a wide variety of settings. Likewise, 
if data pairs (Wi,Fi), . . . , {W n ,Y n ) are generated under the model at 
(jl.2p then, to construct the local constant estimator at (jl.5p , or the local 
linear estimator defined by (|1.8p and (jl.9p . all we do is replace each Xi 
by Wi, and K by Kjj- Other local polynomial estimators can be calcu- 
lated using a similar rule, replacing h~ r (x — Xi) r K{(x — Xi)/h} in S r 
and T r by Ku, r {(% — Wi)/h}, where 

The estimator at (|2.2p dates from work of Carroll and Hall (1988) and 
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Stefanski and Carroll (1990). Deconvolution-kernel regression estimators 
in the local-constant case were developed by Fan and Truong (1993), and 
extended to the general local polynomial setting by Delaigle et al. (2009). 

The kernel Kjj is deliberately constructed to be the function whose 
Fourier transform is 4>k /<fiu- This adjustment permits cancellation of the 
influence of errors in variables, as discussed at the end of section 1.3. To 
simplify calculations, for example computation of the integral in (|1.2I) . 
we generally choose K not to be a density function but to be a smooth, 
symmetric function for which 4>k vanishes outside a compact interval. 
The commonly-used candidates for <pK are proportional to functions that 
are used for K, rather than (f>K, in the case of regular kernel estimation 
discussed in section 1.3. For example, kernels K for which (f>K(t) = 
(1— \t\ r ) s for \t\ < 1, and 4>K{t) — otherwise, are common; here r and s 
are integers. Taking r = 2s = 2, r = s = 2 and r = | s = 2 corresponds 
to the Fourier inverses of the biweight, quartic and triweight kernels, 
respectively. Taking s = gives the inverse of the uniform kernel, i.e. the 
sine kernel, which we shall meet again in section 3. Further information 
about kernel choice is given by Delaigle and Hall (2006). 

These kernels, and others, have the property that 4>K{t) = 1 when t = 
0, thereby guaranteeing that j K = 1. The latter condition ensures that 
the density estimator, defined at (|2.2I) and constructed using this kernel, 
integrates to 1. (However, the estimator defined by (|2.2p will generally 
take negative values at some points x.) The normalisation property is not 
so important when the kernel is used to construct regression estimators, 
where the effects of multiplying K by a constant factor cancel from 
the 'deconvolution' versions of formulae (|1.5[) and (jl.8[) . and likewise 
vanish for all deconvolution-kernel estimators based on local polynomial 
methods. 

Note that, as long as 4>k and <f>jj are supported either on the whole real 
line or on a symmetric compact domain, the kernel Ku, defined by (|2.1I) . 
and its generalised form Ku <r , are real-valued. Indeed, using properties 
of the complex conjugate of Fourier transforms of real- valued functions, 
and the change of variable u — —t, we have, using the notation a(t) for 
the complex conjugate of a complex-valued function a of a real variable 
t, 
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du = Ku :r {x). 



2ni r J 4>u{u/h) 

In practice it is almost always the case that the distribution of U is 
symmetric, and in the discussion of variance in section 2.2, below, we 
shall make this assumption. We shall also suppose that K is symmetric, 
again a condition which holds almost invariably in practice. 

The estimators discussed above were based on the assumption that 
the characteristic function <f>jj of the errors in variables is known. This 
enabled us to compute the deconvolution kernel Kjj at (|2.1|) . In cases 
where the distribution of U is not known, but can be estimated from 
replicated data (see section 1.2), we can replace <pu by an estimator of it 
and, perhaps after a little regularisation, compute an empirical version 
of K(j. This can give good results, in both theory and practice. In par- 
ticular, in many cases the resulting estimator of the density of X, or the 
regression mean g, can be shown to have the same first-order properties 
as estimators computed under the assumption that the distribution of 
U is known. Details are given by Delaigle et al. (2008). 

Methods for choosing the smoothing parameter, h, in the estimat- 
ors discussed above have been proposed by Hesse (1999), Delaigle and 
Gijbels (2004a,b) and Delaigle and Hall (2008). 



2.2 Bias and variance 

The expected value of the estimator at (|2.2[) equals 

<p K {t) 



At 



1 [ „-ttx4>K(ht) 



<t>u{t/h) 

= hj e ~ ite< M W ) 4>x{t)dt=~ J K{u/h)f{x-u)du 
= E{f(x)}, (2.3) 

where the first equality uses the definition of Kjj, and the fourth equal- 
ity uses Plancherel's identity. Therefore the deconvolution estimator 
/decon(^), calculated from data contaminated by measurement errors, 
has exactly the same mean, and therefore the same bias, as f(x), which 
would be computed using values of Xi observed without measurement 
error. This confirms that using the deconvolution kernel estimator does 
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indeed allow for cancellation of measurement errors, at least in terms of 
their presence in the mean. 

Of course, variance is a different matter. Since /decon(^) equals a sum 
of independent random variables then 

var{/ d 

ccon 

(*)} 

~{nh)- l f w {x) [k 2 ^^^ (4, K {tf\<h{t/h)\- 2 dt. (2.4) 
J znnh J 

(Here the relation ~ means that the ratio of the left- and right-hand 
sides converges to 1 as h — > 0.) Thus it can be seen that the variance 
of fdecon(x) depends intimately on tail behaviour of the characteristic 
function <f>jj of the measurement-error distribution. 

If 4>k vanishes outside a compact set, which, as we noted in section 2.1, 
is generally the case, and if \4>u\ is asymptotic to a positive regularly 
varying function ip (see Bingham et ai, 1989), in the sense that \4>jj{t)\ 
ip{t) (meaning that the ratio of both sides is bounded away from zero 
and infinity as t — > oo), then the integral on the right-hand side of 
(|2.3I) is bounded between two constant multiples of ip(l/h)~ 2 as h — > 0. 
Therefore by (1!I4"1) . provided that fw(x) > 0, 

var{/ dccon (a;)} x {nh)' 1 ^{l/hy 2 (2.5) 

as n increases and h decreases. Recall that we are assuming that fu and 
K are both symmetric functions. 

If the density / of X has two bounded and continuous derivatives, and 
if K is bounded and symmetric and satisfies J x 2 \K(x) \ dx < oo, then 
the bias of /decon can be found from (|2.3p , using elementary calculus and 
arguments familiar in the case of standard kernel estimators: 

bias(z) = E{f dccon (x)} - f{x) = E{f(x)} - f{x) 

= j K{u){f{x-hu)- f{x)}du=\h 2 nf'(x)+o(h 2 ) (2.6) 

as h — > 0, where k — f x 2 K(x) dx. Therefore, provided that f"(x) ^= 0, 
the bias of the conventional kernel estimator f{x) is exactly of size h 2 as 
h — > 0. Combining this property, (|2.3p and (|2.5[) we deduce a relatively 
concise asymptotic formula for the mean squared error of fdecon(x): 

E{f dccon (x) - f{x)} 2 x h A + inhy 1 tPil/h)- 2 . (2.7) 

For a given error distribution we can work out the behaviour of ij)(l/h) 
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as h — > 0, and then from (|2.7[) we can calculate the optimal bandwidth 
and determine the exact rate of convergence of /dccon(^) to f(x), in mean 
square. In many instances this rate is optimal, in a minimax sense; see, 
for example, Fan (1991). It is also generally optimal in the case of the 
errors-in- variables regression estimators discussed in section 2.1, based 
on deconvolution-kernel versions of local polynomial estimators. See Fan 
and Truong (1993). 

Therefore, despite their almost naive simplicity, deconvolution-kernel 
estimators of densities and regression functions have features that can 
hardly be bettered by more complex, alternative approaches. The results 
derived in the previous paragraph, and their counterparts in the regres- 
sion case, imply that the estimators are limited by the extent to which 
they can recover from the data. (This is reflected in the fact that the 
rate of decay of the tails of <fiu drives the results on convergence rates.) 
However, the fact that the estimators are nevertheless optimal, in terms 
of their rates of convergence, implies that this restriction is inherent to 
the problem, not just to the estimators; no other estimators would have 
a better convergence rate, at least not uniformly in a class of problems. 



3 Relationship to minimum contrast methods 

3.1 Deconvolution kernel estimators based on the sine 

kernel 

The sine, or Fourier integral, kernel is given by 



Its Fourier transform, defined as a Riemann integral, is the 'boxcar func- 
tion', <f>L(t) = 1 if \t\ < 1 and </>l(£) = otherwise. In particular, 0^ van- 
ishes outside a compact set, which property, as we noted in section 2.1, 
aids computation. The version of Ku, at (|2.1D . for the sine kernel is 



where the second identity holds if the distribution of U is symmetric and 
has no zeros on the real line. 

The kernel L is sometimes said to be of 'infinite order', in the sense 
that if a is any function with an infinite number of bounded, integrable 




(3.1) 
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derivatives then 

2 

dx = 0(h r ) (3.2) 

as h 4- 0, for all r > 0. If K were of finite order then (|3.2j) would hold 
only for a finite range of values of r, no matter how many derivatives 
the function a enjoyed. For example, if K were a symmetric function for 
which J u 2 K(u) du ^ 0, and if we were to replace L in (|3.2p by K, then 
(|3.2p would hold only for r < 4, not for all r. In this case we would say 
that K was of second order, because 

J {a{x + hu) - a(x)} K{u) du = 0(h 2 ) . 

If we take a to be the density, /, of the random variable X, and take K 
in the definition of / at (11.41) to be the sine kernel, L, then Q3.2p equals 
the integral of the squared bias of /. Therefore, in the case of a very 
smooth density, the 'infinite order' property of the sine kernel ensures 
particularly small bias, in an average sense. 

Properties of conventional kernel density estimators, but founded on 
the sine kernel, for data without measurement errors, have been stud- 
ied by, for example, Davis (1975, 1977). Glad et al. (1999)have provided 
a good survey of properties of sine kernel methods for density estima- 
tion, and have argued that those estimators have received an unfairly 
bad press. Despite criticism of sine kernel estimators (see e.g. Politis 
and Romano, 1999), the approach is "more accurate for quite moderate 
values of the sample size, has better asymptotics in non-smooth cases 
(the density to be estimated has only first derivative) , [and] is more con- 
venient for bandwidth selection etc" than its conventional competitors, 
suggest Glad et al. (1999). 

The property of greater accuracy is borne out in both theoretical and 
numerical studies, and derives from the infinite-order property noted 
above. Indeed, if / is very smooth then the low level of average squared 
bias can be exploited to produce an estimator / with particularly low 
mean squared error, in fact of order n~ x in some cases. The most easily 
seen disadvantage of sine-kernel density estimators is their tendency to 
suffer from spurious oscillations, inherited from the infinite number of 
oscillations of the kernel itself. 

These properties can be expected to carry over to density and re- 
gression estimators based on contaminated data, when we use the sine 
kernel. To give a little detail in the case of density estimation from data 



{a(x + hu) — a(x)} L(u) du 
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contaminated by measurement errors, we note that if the density / of 
X is infinitely diffcrcntiable, but we observe only the contaminated data 
Wi, . . . , W n distributed as W, generated as at (|1.1[) ; if we use the dens- 
ity estimator at (|1.4|) . but computed using K = L, the sine kernel; and 
if \<h(t)\ > C (1 + |<|)~ a for constants C, a > 0; then, in view of 
and (pO]) . we have for all r > 0, 



It follows that, if / has infinitely many integrable derivatives and if the 
tails of <fiu(t) decrease at no faster than a polynomial rate as \t\ — > oo, 
then the bandwidth h can be chosen so that the mean integrated squared 
error of a deconvolution kernel estimator of /, using the sine kernel, 
converges at rate 0(n e_1 ) for any given e > 0. 

This very fast rate of convergence contrasts with that which occurs if 
the kernel K is of only finite order. For example, if K is a second-order 
kernel, in which case (|3.2[) holds only for r < 4 when L is replaced by 
K, the argument at (I3.3[) gives: 



The fastest rate of convergence of the right-hand side to zero is attained 
with h = n~ 1 /( 2a+5 ), giving 



In fact, this is generally the best rate of convergence of mean integrated 
squared error that can be obtained using a second-order kernel when 
the characteristic function <f)jj decreases like \t\~ a in the tails, even if 
the density / is exceptionally smooth. Nevertheless, second-order kernels 
are often preferred to the sine kernel in practice, since they do not suffer 
from the unwanted oscillations that afflict estimators based on the sine 
kernel. 




(3.3) 
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3.2 Minimum contrast estimators, and their 
relationship to deconvolution kernel estimators 

In the context of the measurement error model at (ll.l[) . Comte et al. 
(2007) suggested an interesting minimum contrast estimator of the dens- 
ity / of X. Their approach has applications in a variety of other settings 
(see Comte et at, 2006, 2008; Comte and Taupin, 2007), including to 
the regression model at (|1.2D . and the conclusions we shall draw below 
apply in these cases too. Therefore, for the sake of brevity we shall treat 
only the density deconvolution problem. 

To describe the minimum contrast estimator in that setting, define 



where 4>L ke denotes the Fourier transform of the function Lki defined by 
Lm{x) — I 1 / 2 L(£x — k), k is an integer and £ > 0. In this notation the 
minimum contrast nonparametric density estimator is 



There are two tuning parameters, ko and I. Comte et al. (2007) suggest 
choosing I to minimise a penalisation criterion. 

The resulting minimum contrast estimator is called a penalised con- 
trast density estimator. The penalisation criterion suggested by Comte 
et al. (2007) for choosing I is related to cross-validation, although its 
exact form, which involves the choice of additional terms and multi- 
plicative constants, is based on simulation experiments. It is clear on 
inspecting the definition of / that I plays a role similar to that of the 
inverse of bandwidth in a conventional deconvolution kernel estimator. 
In particular, £ should diverge to infinity with n. Comte et al. (2007) 
suggest taking fco = 2 m — 1, where m > \og 2 (n + 1) is an integer. In 
numerical experiments they use m = 8, which gives good performance 
in the cases they consider. More generally, ko/£ should diverge to infinity 
as sample size increases. 

The minimum contrast density estimator of Comte et al. (2007) is ac- 
tually very close to the standard deconvolution kernel density estimator 
at (|1.4[) . where in the latter we use the sine kernel at (|3.ip . Indeed, as 
the theorem below shows, the two estimators are exactly equal on a grid, 
which becomes finer as the bandwidth, h, for the sine kernel density es- 
timator decreases. However, this relationship holds only for values of x 
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for which |a;| < k^/£; for larger values of \x\ on the grid, f(x) vanishes. 
(This property is one of the manifestations of the fact that, as noted 
earlier, k and £ generally should be chosen to depend on sample size in 
such a manner that fco jt — > oo as n — > oo.) 

Theorem Let /decon denote the deconvolution kernel density estimator 
at d 1-41 , constructed using the sine kernel and employing the bandwidth 
h = £~ l . Then, for any point x = hk with k an integer, we have 

7, \ |/docon(a;) if\x\<k a /£ 
|0 if\x\ > k /£. 

A proof of the theorem will be given in section 3.3. Between grid points 
the estimator / is a nonstandard interpolation of values of the kernel 
estimator /decon- Note that, if we take h — £~ 1 , the weights L(£x — k) = 
L{(x — hk)/h} used in the interpolation decrease quickly as k moves 
further from x/h, and, except for small k, neighbour weights are close in 
magnitude but differ in sign. (Here L is the sine kernel defined at (|3.1I) .) 
In effect, the interpolation is based on rather few values fdccon(k/£) 
corresponding to those k for which k is close to x/h. 

In practice the two estimators are almost indistinguishable. For ex- 
ample, Figure 3.1 compares them using the bandwidth that minimises 
the integrated squared difference between the true density and the es- 
timator, for one generated sample in the case where X is normal N(0, 1), 
U is Laplace with var([/)/var(A) = 0.1, and n — 100 or n = 1000. In 
the left graphs the two estimators can hardly be distinguished. The right 
graphs show magnifications of these estimators for x S [— \ , 0] . Here it 
can be seen more clearly that the minimum contrast estimator is an ap- 
proximation of the deconvolution kernel estimator, and is exactly equal 
to the latter at x = 0. 

These results highlight the fact that the differences in performance 
between the two estimators derive more from different tuning para- 
meter choices than from anything else. In their comparison, Comte et 
al. (2007) used a minimum contrast estimator with the sine kernel L 
and a bandwidth chosen by penalisation, whereas for the deconvolution 
kernel estimator they employed a conventional second-order kernel K 
and a different bandwidth-choice procedure. Against the background of 
the theoretical analysis in section 3.1, the different kernel choices (and 
different ways of choosing smoothing parameters) explain the differences 
observed between the penalised contrast density estimator and the de- 
convolution kernel density estimator based on a second-order kernel. 
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Figure 3.1 Deconvolution kernel density estimator (DKDE) and min- 
imum contrast estimator (PCE) for a particular sample of size 
n = 100 (upper panels) or n — 1000 (lower panels) in the case 
var([/)/var(X) = 0.1. Right panels show magnifications of the es- 
timates for x G [—0.5, 0] in the respective upper panels. 



3.3 Proof of Theorem 

Note that <j>L M (t) = l~ 1/2 exp(itk/£) <j> L (t/£) and 

n r e-n 



Therefore, 



1 ^0 n „£jr 

= ^ E L{£x-k)Y, J evi-itikt-i-W,)} 
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kg 

= L(ex-k)f decon (k/£). (3.4) 

k— — ko 

If r is a nonzero integer then L(r) = 0. Therefore, if x = kh = s/£ for an 
integer s then L(ix — k) — whenever k ^ s, and L(& — fc) = 1 if k = s. 
Hence, (|3.4p implies that f(x) = fdecon(x) if |fc| < fco, and f(x) = 
otherwise. 
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